import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import matplotlib.patches as mpatches
import itertools
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
df = pd.read_csv('no_show_appointments.csv')
df.head(3)
## Fixing columns names
df.rename(mapper = lambda x: x.lower().replace('-', '_').replace('id', '_id').replace('day', '_day'),
axis = 1,
inplace= True)
df.head(1)
df.info()
## Changing the types of patient_id and appointment_id to strings
## as int or float datatypes for ids here misleading
df['patient_id'] = df['patient_id'].apply(lambda x: str(int(x)))
df['appointment_id'] = df['patient_id'].apply(lambda x: str(x))
## unique values for every column
for column in ['diabetes', 'handcap', 'sms_received', 'alcoholism', 'hipertension', 'no_show']:
print('{:<18}'.format(column), pd.unique(df[column]))
def check_for_empty_str(df):
"""
Check for empty strings in string typed columns
as the null check of pandas check only for Nones where
empty string is still a value
"""
print('\nChecking for empty strings:\n')
for column in df.columns:
if df[column].dtype == 'O':
print('{:<18}'.format(column), df[column].apply(lambda x: x.strip() == '').any())
print(df.isna().any())
check_for_empty_str(df)
## Converting categorical no-show, gender columns to numerical value for analysis
df['no_show'] = (df['no_show'] == 'Yes').astype(int)
df['gender'] = (df['gender'] == 'M').astype(int)
df.head(3)
df.shape
## Getting number of unique values in every columns plus checking for any
## duplicate values in appointment_id (shouldn't be any duplicates)
print(
df.nunique(),
'Duplicates:\n{:<19}{}'.format('appointment_id', df['appointment_id'].duplicated().any()),
sep = '\n\n'
)
df.shape[0]
## checking the number of duplicate rows and
## dropping them since they carry no additional information
print(df.duplicated().sum())
df.drop_duplicates(inplace=True)
df.shape[0]
## duplicates in appointment_id
df.duplicated('appointment_id').sum()
As the appointment_id isn't unique, the column name authenticity smells as generally speaking, multiple appointments can be attended by multiple patients, but multiple patients can't attend the same appointment, so I'll further investigate the values of the appointment_id.
dup_app_ids = df[df.duplicated('appointment_id', keep = False)]
print(dup_app_ids.shape[0])
dup_app_ids.sort_values('appointment_id', inplace=True)
dup_app_ids.head(10)
Closely, looking at the appointment_id and patient_id columns in the previous sample, their values seems identical, so I'll check if that the case in the entire data.
## Checking if all appointment_ids equals their corresponding patient_ids
(df['patient_id'] == df['appointment_id']).all()
Since the column is redundant, I'll drop it.
df.drop('appointment_id', axis = 1, inplace=True)
df.head(1)
df.shape[0]
df.nunique()
## Just rechecking that patient_ids covers the entire data
## Note!! Checking by isnull gives the same, but I wanted to observe
# the different chunks of patient_id
n_unique_ids = df.nunique()['patient_id']
print(n_unique_ids)
all_dup = df.duplicated('patient_id', keep = False).sum()
unique_dup_ids = all_dup - df.duplicated('patient_id').sum()
n_unique_ids - unique_dup_ids + all_dup == df.shape[0]
The following investigation I came to by chance, when checking duplications of appointment_id column, I tried to slice the columns to what I thought "inherent" or "slow to change" characteristics of the patients columns which are 'patient_id', 'gender', 'age' columns, but I sliced columns with df.loc[:, :'age'] and I found duplicates.
To my amazement this code included both 'scheduled_day', 'appointment_day' which are date and time columns with very very little chance for time to be duplicated especially with the 'scheduled_day' column. Then, I kept slicing including more columns in each step till number of duplicated columns dropped with the inclusion of 'sms_received' column (and absolutely with the inclusion of 'no-show', duplicates are 0'). That's why I'll eventually drop the first occurrences of such columns keeping only the last pushed states of 'sms_received'.
Below are the steps I did.
def get_numof_duplicates_till_column(df, end_column):
"""
slice the dataframe df till the end_column column
and check for duplicate rows and returns them and
their number as whole (all occurrences)
"""
dup_df = df.loc[:, :end_column].duplicated(keep = False)
return dup_df, dup_df.sum()
print(
'%-18s' % 'age' + '%d' % get_numof_duplicates_till_column(df, 'age')[1],
'%-18s' % 'handcap' + '%d' % get_numof_duplicates_till_column(df, 'handcap')[1],
'%-18s' % 'sms_received' + '%d' % get_numof_duplicates_till_column(df, 'sms_received')[1],
sep = '\n'
)
I'll drop the duplicated columns for those till 'sms_received' keeping the last ones
as they were pushed to the system last.
(The operator may realized he made a mistake
in the 'sms_column' or 'no_show' column and decided to re-enter the data)
duplicates = df.loc[:, :'handcap'].duplicated(keep = 'last')
print(duplicates.sum())
print('\nSame Size? %s'% (duplicates.shape[0] == df.shape[0]))
df[duplicates].sort_values('patient_id').head(4)
## droping the duplicates
df = df[-duplicates]
df.shape
## Double checking for any more duplicates
get_numof_duplicates_till_column(df, 'sms_received')[1]
## Resetting index
df.reset_index(drop = True, inplace=True)
df.head(3)
=========================================================================================================================
=========================================================================================================================
=========================================================================================================================
Since appointment_day hour of the day isn't specified in all rows (all are 00:00:00). Then it's not really important what hour was the scheduled_day, too because all appointments can't be in the same hour. Similarly, no point in categorizing the hour of the appointment day or the waiting time of patients scheduling their appointments in the same day as the appointment as both investigations would be not authentic. What may matter is the interval between the appointment_day and scheduled_day. Are they at the same day or not, or whether the interval is too long that the patient may forget or neglect it.
## Converting dates columns from string to datetime objects
df['scheduled_day'] = pd.to_datetime(df['scheduled_day'].str[:10], format = '%Y-%m-%d')
df['appointment_day'] = pd.to_datetime(df['appointment_day'].str[:10], format = '%Y-%m-%d')
df['appointment_day'].dtype
## Deriving the interval between from scheduled_day to appointment_day
df['days'] = df['appointment_day'] - df['scheduled_day']
df['days'] = df['days'].dt.days
df.describe()
## Checking number of false values in age column
df[df['age'] < 0].shape[0]
## Checking if the the same patient made other appointments
df[df['patient_id'] == df[df['age'] < 0].reset_index()['patient_id'][0]].shape[0]
Age can't be negative, so we'll remove corresponding entries escpecially since there is only one corresponding entry with a patient id that isn't duplicated anywhere in the data, so no way for correction.
Examining the days column where min('days') == -6 < 0. It's clear that some entries are false due to technical or human mistakes, as for those entries the patient appointment_day was before the scheduled_day which can't happen. I wouldn't be surprised if all the patients of the corresponding column didn't attend the appointment. Let's check that anyway.
## FIX THE NEGATIVE ROWS
print('Do any of the patients with wrong appointment day attend the appointment? \n{} as expected'.format(
(df[df['days'] < 0]['no_show'] == 0).any())
)
## Removing the false rows corresponding to negative days interval
## between the appointment day and scheduled day
df = df.query('days >= 0').reset_index(drop = True)
(df['days'] < 0).any()
## Removing rows corresponding to negative ages
print('%-12s'%'Before',(df['age']<0).any())
df = df[df['age'] >= 0]
print('%-12s'%'After',(df['age']<0).any())
df.shape
## final check
df.describe()
df.head(3)
Questions:
1. Is there a neighborhood with signifcant absence rate? 2. Does age affect the appointment attendance? 3. Do appointments scheduled very early ahead of the appointment affect the attendance? 4. Do people in weekends attend more appointments than in workdays? 5. What about people with special conditions (diabetes, hypertension, handcap)? 6. What about sms_recepients?
def groupby_and_agg_by_column(df, groupby_column, agg_column = 'no_show',
agg_functions = ['sum', 'mean', 'count']):
"""
groupby the df data by 'grouby_column' and aggregate the data
by the requested agg_column which defaults in our analysis to 'no_show' column
and agg_functions to ['sum', 'mean', 'count']
and returns the aggregated data
"""
group_df = df.groupby(groupby_column, as_index=False).agg({agg_column: agg_functions})
group_df.columns = group_df.columns.droplevel(0)
group_df.columns = list([groupby_column] if type(groupby_column) == str else groupby_column) + list(group_df.columns[-3:])
return group_df
First, I will try investigate the absence by the mean as the mean reflect the probability of missing the appointment in the neighborhood.
## Grouping by neighborhood, aggreagating, sorting descending by mean, then sum of misses
neigh_df = groupby_and_agg_by_column(df, 'neighbourhood')
neigh_df.sort_values(['mean', 'sum'], ascending=[0,0], inplace=True)
neigh_df.reset_index(drop = True, inplace= True)
neigh_df.head(10)
neigh_df.tail(10)
Although the mean reflects the probability of missing well, but there are some neighborhood with few number of appointments compared to other neighborhoods.
Practically, the mean here really doesn't reflect upon the number of misses. For example, ILHAS OCEÂNICAS DE TRINDADE neighborhood has 100% misses, but it just represensts 2 appointments likewise in PARQUE INDUSTRIAL neighborhood with 0% misses, but just one appointment.
Hence, this probabilty is weak because for any additional attended appointment entry added for this neighborhood, the mean would change drastically (e.g., .67, .5, .4, .333, etc in case of ILHAS OCEÂNICAS DE TRINDADE mean). Hence, I will investigate the neighborhood with highest number of misses as thier means are more stable.
I will remove any neighborhood with count < 50 considering them outliers. As in these neighborhoods, any change in misses affect the mean drastically.
neigh_df = neigh_df.query('count >= 50').reset_index(drop = True)
neigh_df.describe()
neigh_df.sort_values(['sum', 'mean'], ascending=[0,0], inplace=True)
neigh_df.head(10)
Top 10 neighborhoods with appointments misses.
plt.figure(figsize = (10,10))
ax = plt.subplot(111)
ax.xaxis.set_label_position('top')
ax.xaxis.tick_top()
plt.grid(True, alpha = .5, axis = 'x', zorder = 3.5, c = 'gray', linestyle = '--')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
container = plt.barh(range(0, 10), neigh_df['sum'][:10], zorder = 2.5)
means_max = neigh_df['mean'][:10].max()
X = list(range(10))
opacity = list(np.linspace(1,.2,10))
Y = list(neigh_df['mean'][:10])
order = [x for _,x in sorted(zip(Y,X), reverse = True)]
opacity = dict(zip(order,opacity))
for i in range(len(container.patches)):
container.patches[i].set_alpha(opacity[i])
plt.yticks(range(0, 10), neigh_df['neighbourhood'][:10])
plt.ylabel('Neighbourhood\n', fontsize = 14)
plt.xlabel('Appointments Misses Count\n', fontsize = 14)
plt.title("\n\nTop 10 Appointment Misses by Neighbourhood\n\n\n\n", fontsize = 16)
plt.annotate('The opacity of columns represent the mean of misses\n with respect to maximum mean in the group\n(ITARARE neighborhood)', xy= np.array([850, 2]), xytext=(300,-100),
ha='center', textcoords='offset points', bbox=dict(boxstyle='round,pad=0.5', fc='gray', alpha=0.1),
arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0.5',color='gray'))
plt.gca().invert_yaxis()
plt.show()
Considering an average mean of 0.2 for all neighborhoods, JARDIM DA PENHA, TABUAZEIRO and JARDIM CAMBURI neighborhoods means are below the average reflecting medical awareness in this neighborhood may be better than most of the other neighborhoods. Meanwhile neighborhoods such as CARATOÍRA, JESUS DE NAZARETH and especially ITARARÉ reflect maybe some kind of indifference among people there to medical conditions as their misses means are well above and more than one standard deviations from the average.
Situation dictates medical awareness campaigns to be concentrated on such areas.
def sytle_plot_spines(*axes):
"""
takes axes object and apply my favorite style to the plot axes
"""
for ax in axes:
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_alpha(.6)
ax.spines['bottom'].set_alpha(.6)
First, I'll take general look at the appointment distributions with respect to age
plt.figure(figsize = (12, 4))
ax1 = plt.subplot(121)
plt.hist(df.query('no_show == 1')['age'], bins = range(0, 110, 10), edgecolor = 'black', alpha = .6)
plt.title('\n\nMissed Appointments\n\n', fontsize = 14, loc = 'left')
ax2 = plt.subplot(122)
plt.hist(df.query('no_show == 0')['age'], bins = range(0, 110, 10), edgecolor = 'black', alpha = .6)
plt.title('\n\nAttended Appointments\n\n', fontsize = 14, loc = 'left')
sytle_plot_spines(ax1, ax2)
plt.xticks(range(0, 100,10));
It seems most of the misses is from young ages 0-10, but most of the attendances are from that same group.
People between 10-40 have many misses and not too many attendances in comparison with 0-10 group. So, I expect the most misses rate to be in this region, especially 10-20 and 20-30 groups.
I'll categorize the ages to age groups to be more specific to real life and analyze the misses means.
## Categorized the age column
df['age_group'] = pd.cut(df['age'], bins=[0, 0.6, 2, 12, 18, 25, 40, 60, 200],
include_lowest=True, right = True,
labels=['new born', 'infant', 'child', 'teenager', 'young adult', 'adult', 'middle age', 'old'])
## grouping by age_group
age_df = groupby_and_agg_by_column(df, 'age_group')
age_df
## categorize by scholarship << or gender
group_order = ('new born', 'infant', 'child', 'teenager', 'young adult', 'adult', 'middle age', 'old')
func = lambda x: '\n' + x
group_order = tuple(func(i) for i in group_order)
plt.figure(figsize = (10, 5))
ax = plt.subplot(111)
sytle_plot_spines(ax)
cont = plt.bar(range(0, len(age_df['age_group']) * 8, 8), age_df['mean'], width = 5, color = '#0073FF')
cont[3].set_color('#75e504')
plt.xticks(range(0, len(age_df['age_group']) * 8, 8), group_order);
plt.xlabel('\nAge Group\n\n', fontsize = 12)
plt.ylabel('Probablity of Missing the Appointment\n\n', fontsize = 12)
plt.title('\nMissing rate for each Age Group\n', fontsize = 14)
ax.set_facecolor('#f4f4f9')
plt.show()
It's clear that teenagers, young adults, adults and children constitute the highest rates as expected (10-40) from initial investigation with teenagers representing the highest misses mean. While children and possibly teenagers responsibility may lay on families. Young adults and adults responsibilities lie on themselves reflecting the medical awareness of individuals of different ages to themselves. However, could financial problems are the main cause of absence? I'll futher investigate the data with scholarship to pinpoint the problem.
## General investigation of the scholarship column
schol_df = groupby_and_agg_by_column(df, 'scholarship')
schol_df
Generally, It seems scholarships don't affect the probability of attendance positively. As the mean of scholarship given misses is greater than non-scholarship given individuals. Is this case among different age groups?
temp = groupby_and_agg_by_column(df, ['age_group', 'scholarship'])
temp.head(10)
plt.figure(figsize = (10,5))
ax = plt.subplot(111)
ticks = []
labels = list(range(4, 4 + 12 * 8, 12))
step = 12
pointer = 0
for i in range(int(temp.shape[0] / 2)):
ticks += [pointer, pointer + 4]
labels += [pointer + 4]
pointer += step
sytle_plot_spines(ax)
bar_cont = plt.bar(ticks, temp['mean'], width = 4, align = 'edge', alpha = .7)
patches = bar_cont.patches
for i in range(len(patches)):
if i % 2 == 0:
patches[i].set_fc('#a1f79b')
else:
patches[i].set_fc('#FFB000')
non_sponsored = mpatches.Patch(color='#a1f79b', label='Not Sponsored')
sponsored = mpatches.Patch(color ='#FFB000', label = 'Sponsored')
plt.legend(handles=[non_sponsored, sponsored], fontsize = 11)
plt.xticks(labels, group_order);
plt.title('\nAppointment Absence Mean by Scholarship\n', fontsize = 14);
plt.xlabel('\nAge Group\n\n', fontsize = 12)
plt.ylabel('Probablity of Missing the Appointment\n\n', fontsize = 12)
Although scholarsihps given to new borns and infants rates are better than non-given ones. Scholarships failed to explain the missing rates of the other age groups. Meaning there is no direct simple relationship between scholarships and attendance in different age groups. That's why attendance I may relate to medical awareness, but my results here are tentative as scholarship can be proven to be useful in predicting attendance through deep neural networks or other classification model.
plt.figure(figsize=(6,6))
ax = plt.subplot(111)
g = sns.boxplot(x='no_show', y = 'days', data = df, zorder = 3.6);
g.set(ylim = (-10, 65))
ax.set_facecolor('#efeded')
plt.show()
df[df['days'] > 50].shape[0]
Although there are many possible outliers in the days column. Their number is big, so they can't be completely excluded. I'll categorize the days to same day, same week, up to one month, up to two months, and beyond.
df['interval_category'] = pd.cut(df['days'], bins=[0, .5, 7, 30, 60, 1000],
include_lowest=True, right = True,
labels=['same_day', 'same_week', 'same_month', 'two_months', 'beyond'])
df.head(3)
days_df = groupby_and_agg_by_column(df, 'interval_category')
days_df
plt.figure(figsize = (8,4))
sytle_plot_spines(plt.subplot(111))
plt.bar(days_df['interval_category'], days_df['mean'], alpha = .7)
plt.title('\nTime to Appointment by Time Category\n', fontsize = 14);
plt.xlabel('\nTime Category\n\n', fontsize = 12)
plt.ylabel('Probablity of Missing the Appointment\n\n', fontsize = 12)
plt.show()
People scheduling the appointment at the same day have very low missing rate as expected and it seems the more the interval between the the scheduling and the appointment, the higher the misses mean. 'beyond' group catches outliers (more than 60 days intervals) Although its mean is less than 'same_month' and 'two_months' mean, it catches outliers with relatively small group size compared to other group sizes, so it could be handled as special case alone.
df['day_of_week'] = df['appointment_day'].dt.dayofweek
df.head(3)
from datetime import datetime
import calendar
by_dayofweek_df = groupby_and_agg_by_column(df, 'day_of_week')
day_mapper =dict(zip(range(7), calendar.day_name))
by_dayofweek_df['week_name'] = np.array([day_mapper[x] for x in by_dayofweek_df['day_of_week']])
by_dayofweek_df
Basically, there is no appointment at all on Sundays. On Saturday, there are too few appointments to make comparisons between workdays and weekends appointments. Consequentally, there is no point on carrying on this analysis between workdays and weekends.
First I will investigate every column mean individually, then the interaction between the columns
def get_probability_of_no_show(df, columns, values):
"""
get binary-valued 'columns' and returns the missing mean
of the combinations of column values stored in 'values'
in addition to the number of appointments to the query
"""
if len(columns) != len(values):
raise Exception('Number of columns not equal to number of values')
else:
query = ''
for column, value in zip(columns, values):
query += '{} == {} and '.format(column, value)
query = query[:-4]
temp_df = df.query(query)['no_show']
return temp_df.mean(), temp_df.count()
## Aggregate mean over entire group categories
columns = ['alcoholism', 'diabetes', 'hipertension']
print('%24s'%'Mean', '%10s'%'Count')
for column in columns:
print('%-18s'%column, '%-10.3f %d'%get_probability_of_no_show(df, [column],[1]))
def label_venn(v,df , columns, repeat):
"""
takes venn v object(either venn2, venn3 or more if there is), columns to represent
in the venn diagram and label each region in the venn diagram with its probability
of missing and the regions number of samples
"""
combinations = list(itertools.product([0, 1], repeat= repeat))
combinations.remove((0,) * repeat)
for comb in combinations:
v.get_label_by_id(''.join(str(x) for x in comb)).set_text('%.3f/%d'%get_probability_of_no_show(df, columns, comb))
## Measuring interaction between each 2 combination
## of the columns 'alcoholism', 'diabetes', 'hipertension'
from matplotlib_venn import venn2, venn2_circles
figure, axes = plt.subplots(1, 3)
figure.set_size_inches((16,5))
index = 0
for comb in list(itertools.combinations(columns, 2)):
v = venn2(subsets=(1,1,1), set_labels = ['\n\n' + i +'\n\n' for i in comb], ax=axes[index])
c = venn2_circles(subsets=(1, 1, 1), linestyle='-', ax = axes[index])
c[0].set_edgecolor('#1ef213')
c[1].set_edgecolor('#29c3ce')
v.get_patch_by_id('A').set_color('#1ef213')
v.get_patch_by_id('B').set_color('#29c3ce')
label_venn(v, df, comb, 2)
index += 1
plt.annotate('\nMean of misses/Num of appointments\n', xy=v.get_label_by_id('100').get_position() - np.array([0, 0.05]), xytext=(-100,110),
ha='center', textcoords='offset points', bbox=dict(boxstyle='round,pad=0.5', fc='gray', alpha=0.1),
arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0.5',color='gray'))
del index
plt.show()
It seems people with more than one special condition misses rate are lower like with diabetes and alcoholism however that's not the case with hypertensions.
Examining Interaction between all three columns at the same time ('alcoholism', 'diabetes', 'hipertension')
from matplotlib_venn import venn3, venn3_circles
plt.figure(figsize=(8,8))
v = venn3(subsets=(1, 1, 1, 1, 1, 1, 1), set_labels = ('Alcoholism', 'Diabetes', 'Hypertension'))
label_venn(v, df, columns, 3)
venn3_circles(subsets=(1, 1, 1, 1, 1, 1, 1), linestyle='--')
plt.title("\n\nAlcoholism, Diabetes, Hypertension Probabilistic Effect on Attendance\n\n", fontsize = 14)
plt.annotate('Mean of misses/Num of appointments', xy=v.get_label_by_id('100').get_position() - np.array([0, 0.05]), xytext=(-170,-70),
ha='center', textcoords='offset points', bbox=dict(boxstyle='round,pad=0.5', fc='gray', alpha=0.1),
arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0.5',color='gray'))
plt.show()
I will examine the three variables, handicap, sms for completion of analysis in case I missed any thing important, then I will build simple classification model with scik
temp_df = groupby_and_agg_by_column(df, 'handcap')
print(temp_df)
plt.bar(temp_df['handcap'], temp_df['mean']);
As the group sizes of handicap degree 4 and 3 very small, we may not consider their means as significant.
temp_df = groupby_and_agg_by_column(df, 'sms_received')
print(temp_df)
plt.bar(temp_df['sms_received'], temp_df['mean']);
plt.xticks(range(0,2));
The proportion of missing of those who received sms messages is more than those who didn't receive, surprisingly
import statsmodels.api as sm
print(df.columns)
df.head()
df2 = df[['age' , 'days' , 'no_show']]
df2.head()
## Getting the correlation coefficient betwee age and days columns
np.corrcoef(df2['age'], df2['days'])[0,1]
plt.figure(figsize = (8,6))
sytle_plot_spines(plt.subplot(111))
plt.scatter(df2['age'], df2['days'], alpha = .4)
plt.xlabel('\nAge', fontsize = 12)
plt.ylabel('Days to Appointment\n', fontsize = 12)
plt.title('\nDays to Appointment and Age Relation\n\n', fontsize = 15)
plt.show()
Based on the correlaiton coefficient and the scatter plot, days and age columns seems independent
c = df2['no_show'].apply(lambda x: 'blue' if x == 1 else 'red')
plt.figure(figsize = (10, 7))
sytle_plot_spines(plt.subplot(111))
plt.ylim(0, 120)
plt.scatter(df2['age'], df2['days'], alpha = .4, c = c)
plt.xlabel('\nAge', fontsize = 12)
plt.ylabel('Days to Appointment\n', fontsize = 12)
plt.title('\nDays to Appointment and Age Relation\n\n', fontsize = 15)
showed = mpatches.Patch(color='red', label='Attended')
not_showed = mpatches.Patch(color ='blue', label = 'Missed')
plt.legend(handles = [showed, not_showed],
loc = 'center', bbox_to_anchor=(.9, .3, 0.3, 1),
title = 'No-Show', fontsize = 11)
plt.show()
Although the data are heavily overlapped, there certain regions with heavily-densed misses and others with heavily-densed attendances. I am not a machine learning expert, just novice, but the way I think this classification can work, maybe by dividing the data to smaller regions and classify each subproblem individually. I will try classification with sklearn with both the whole data and the cutted data and will see the accuracy of the classification.
## proportion of appointment in the same day to all appointments
df2.query('days == 0').shape[0]/ df2.shape[0]
print(
'%.3f'%df2.query('days == 0')['no_show'].mean(),
'%.3f'%df2.query('days == 1')['no_show'].mean(), sep = '\n'
)
One important thing to note from my previous analysis is that great proportion of the appointments was in the same day and had very low miss rate, so looking at the previous plot realizing that just one small horizontal line out of more than 100 different lines emerging from the 0 days containing more than one third of the data and with very few miss rate, that may gravely affect the classification of other points, so I would check that.
## I will discard data points with days more than or equal to 100 (considering them outliers)
## and with age more than or equal to 100 to concentrate the classification upon the more coherent
df2 = df2.query('days < 100 and age < 100')
df2.shape
df2['intercept'] = 1
lmodel = sm.Logit(df2['no_show'], df2[['intercept', 'age', 'days']]).fit()
lmodel.summary()
- The 0 p-values for both age and days columns coefficient indicate their significance, non of the coefficients can be 0.
- For every year decrease in age, there is 1.00773 more chance for the patient to show.
- For every day increase, there is 1.02829 more chance for the patient to show.
df_exc_day0 = df2.query('days > 0')
lmodel = sm.Logit(df_exc_day0['no_show'], df_exc_day0[['intercept', 'age', 'days']]).fit()
lmodel.summary()
- The 0 p-values for both age and days columns coefficient indicate their significance, non of the coefficients can be 0.
- For every year decrease in age, there is 1.00995 more chance for the patient to show.
- For every day increase, there is 1.00844 more chance for the patient to show.
## Aggregate columns by days
temp = groupby_and_agg_by_column(df2, 'days')
temp.head()
temp.describe()
# comparing the shift in no_show mean between days
plt.figure(figsize=(6,4))
sytle_plot_spines(plt.subplot(111))
diffs = []
for i in range(len(temp['days']) - 1):
diffs += [temp['mean'][i + 1] - temp['mean'][i]]
print('Diffs Mean: %.7f' % np.mean(diffs))
plt.scatter(temp['days'][1:], diffs)
plt.ylim((-2 * np.std(diffs), 2 * np.std(diffs)))
plt.xlim((0, 100))
plt.title('\nDay to Day Missing Rate difference\n\n', fontsize = 14)
plt.xlabel('\nDays', fontsize = 12)
plt.ylabel('Means Differences\n', fontsize = 12);
Comparing the model with day 0 and without day 0, we can see significant change in days coefficient from .0285 to .0084 which is reasonable as more than one third of the day no-show mean shifted .1665 (from 0.046298 to 0.212791) in mean while the average day to day shift is 0.0015684. Therefore, I will adapt the second model excluding day 0 data as day 0 data is with no-show mean .046 is pretty predicted.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
np.random.seed(42)
def train_model(df, X, y, method):
"""
train model of data 'df' with 'X' columns as predictors
and 'y' columns as response using the sklearn model 'method'
"""
X_train, X_test, y_train, y_test = train_test_split(df[X] , df[y],
test_size = .2,
random_state = 42)
model = method(random_state=42, tol=1e-5)
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
print(
'\n'
'%-20s'%'Precision' + '%.4f' % precision_score(y_test, y_predict),
'%-20s'%'Recall' + '%.4f' % recall_score(y_test, y_predict),
'%-20s'%'Accuracy' + '%.4f' % accuracy_score(y_test, y_predict), '\n',
'Confusion matrix', confusion_matrix(y_test, y_predict),
sep = '\n'
)
train_model(df2, ['days', 'age'], 'no_show', LogisticRegression)
train_model(df_exc_day0, ['days', 'age'], 'no_show', LogisticRegression)
With a recall score of 0.000, it's clear that the model is doing no-good in predicting people missing appointments. The model of recall .0243 of data with day 0 is doing no-good, too, but it's better than those wihtout day 0.
train_model(df2, ['days', 'age'], 'no_show', LinearSVC)
train_model(df_exc_day0, ['days', 'age'], 'no_show', LinearSVC)
From the previous two analyses, it seems extracting 0 day was a mistake in anlysis, so I'll use df2 with day 0 to predict misses after dividing the data to 9 smaller regions.
def plot_scatter(df):
c = df['no_show'].apply(lambda x: 'blue' if x == 1 else 'red')
plt.figure(figsize = (3, 3))
sytle_plot_spines(plt.subplot(111))
plt.scatter(df['age'], df['days'], alpha = .4, c = c)
plt.xlabel('\nAge', fontsize = 12)
plt.ylabel('Days to Appointment\n', fontsize = 12)
plt.title('\nDays to Appointment and Age Relation\n\n', fontsize = 15)
plt.show()
## Now I'll try dividing the data
for i in range(33,100,33):
for j in range(33, 100, 33):
temp_df = df2.query('age >= %d and age < %d and days >= %d and days < %d' % (j - 33, j, i - 33, i))
train_model(
temp_df,
['days', 'age'], 'no_show',
LinearSVC
)
plot_scatter(temp_df)
print('==' * 50)
Manay sub groups classification model are doing poorly, but there is some improvement in the recall like in group one.
Hence, we may further subdivide the group keeping minum size of a group and desired recall ratio as limitations to the algorithms.
Finally , Results obtained here are tentative and that concludes the analysis.
- is the data provided by the dataset sufficient to answer your question?
- is the size of the dataset is sufficient to give a good judgement about the questions you asked?
- What are the major difficulties you encountered in the dataset while performing the analysis
- Some of the questions specific to patients missing the appointment can't be answered using this data, examples:
- What time of day appointments suffer the most misses rates (morning, evening, rush hours, etc) and that's the specific appointments hours, minutes, seconds times aren't specified, rather all of them are defaulted to 00:00:00.
- Some important information that may better predict the missing rates about individuals doesn't exist in the data like individual income, level of education.
- The fact that the appointment_id was the same as the patient_id forces us to rely upon the appointment_day column to reflect upon the number of appointments a particular patient had which is incorrect as the patient could have had multiple appointments in the same day.
- The size of the data set is near 100,000 which is good size for a dataset to draw analaysis from, but this size when categorized by some of the attributes of data like neighborhood, some deficiencies appear in analysis due to the deficiency of number of records for this specific neighborhood or that specific medical condition like handcap. Generally in analysis, the more data, the better generality.
- Major difficulties:
- In cleaning the data, the duplicate duplicate scheduled day, as for the scheduled day, every entry value specified time till seconds (very small granularity), it seemed obscure to have duplicate entries in the columns. After investigation, I realised that those duplicate entries had same patient_id with the lastly pushed entries have a changed value in another column like sms_received or a medical condition. This situation can happen for example if the operator entered some wrong information for the patient and the patient corrected him, so he will reenter the data of the patient with the corrected info with the same scheduled day information. That why I chose to keep the last columns
- When performing classification analysis for misses rate based on the independent variables age and days. The data was heavily overlapped for classification. I tried to furhter categorize the data with one or more independent variable like scholarship, gender, alcoholism, but the data remained heavily overlapped. That's why I removed this step from analysis and kept the analysis based on only age and days columns with subdivided the data to smaller ones.